library(tidyverse)This blogpost was concieved as a short look at monte carlo simulation, but turned into both musings on some health economic decision principles, and discussion of the advantages and drawbacks of using AI.
1 Purpose
1.1 A little intro text
Inspired by Fooled by Randomness by Nassim Taleb, I decided to try doing some Monte Carlo simulations of different potential life paths ahead of me.
Monte Carlo simulations are not foreign to me, as they are frequently used in health economic models to take uncertainty into account when presenting a comparison between different health interventions or strategies.
The way that is usually done is by having point estimates and confidence intervals - or credibility intervals if you are bayesian, which those types of analyses mostly are - for certain outcomes, both in terms of their costs and effects, associated with each option being considered.
These estimates and intervals are then sampled randomly in simulations around at least 1000 times, to provide estimates of which option is better than the other, in probability clouds.
These probability clouds represent the spread of the different possible outcomes, given the uncertainty associated with the confidence intervals.
Imagine two dots on a plot, each representing the cost and effects of the interventions. One seems to be in better place than the other. Is the intervention with seemingly the best cost/effect ratio compared to the other always the better choice? Not necessarily.
With the monte carlo simulation, you can see thousands of dots of the two different colours. The clouds may overlap, or the cloud representing the seemingly worse intervention might have more estimates that seem to be in more advantagous spots.
Advantageous how? Well, more of the estimates may be in the “much higher effect, at a sligthly higher cost” than the single point estimate mentioned earlier. Visualising how the possible estimates spread out, in many different “worlds” or scenarios, can help decision-makers determine what is likely to be the best option, at least much better than a single point estimate, and a range of uncertainty with two numbers.
So. That would be nice to try to transfer to different life choices, As right now, I am looking for a job, and hey, who knows what kind of job I can get.
1.2 Distilled purpose
We will start simple with a comparison of 3 different life paths from on here:
- Public sector job
- Private sector job
- Start own business / “Startup”
Not to conclude what is best, but to see what the possible consequences can be.
This is of course ridiculously simplified, as I am likely to switch jobs more than once after the initial choice. That could be the theme of a different post - a decision tree or markov model where spending time in different states is possible.
1.3 Analysis plan
In brief:
- 3 different options
- 20 year into the future
- 1000 simulations of “life paths”
Because the tool is available and I would be insane not to use it, I use GPT to help me generate the template, and then learn how to change it from there to fit my needs.
Also, I learn how it works, preferably by both searching for different ways of doing it online.
2 Process
No need to load and clean data, as I will generate it myself.
2.1 Setup
Just need the tidyverse package for this. Might try adding health economic packages like heemod, dampak, or hesim in the future. Maybe there is a tidyverse like package collection for health economic packages?
2.1.1 Parameters
Years, simulations, options as a tibble.
# basic parameters
n_years <- 20
n_sim <- 1000
# decisions/choices
decisions <- tibble(
decision = c("Private", "Public", "Startup"),
startup_prob_success = c(NA, NA, 0.30),
# all moneys in 1000 DKK/year
startup_income_success = c(NA, NA, 6000),
base_income = c(600, 400, 50),
growth_rate_income = c(0.02, 0.02, 0.03),
std_dev_income = c(10, 5, 300),
base_content = c(3.2, 3.5, 3.0),
content_sd = c(0.4, 0.3, 0.9),
burnout_prob = c(0.10, 0.05, 0.25),
health_shock_prob = c(0.05, 0.05, 0.15)
)2.1.2 Simulation function
First the simulation function for one simulation, for one decision.
simulate_one_life <- function(decision_row, sim_id) {
# if dec=startup, calculate prob of success, if not, NA
has_success <- ifelse(decision_row$decision == "startup",
rbinom(1, 1, decision_row$startup_prob_success), NA)
# Income, for startup it depends on success
years <- 1:n_years
income_base <- decision_row$base_income * (1 + decision_row$growth_rate_income)^(years - 1)
# for the gamma distribution thats necesary for startups (no 0)
income <- case_when(
decision_row$decision == "Startup" & !is.na(has_success) & has_success == 1 ~
rep(decision_row$startup_prob_success, n_years),
decision_row$decision == "Startup" ~ {
# I assume gamma because income even for startup is probably from 0 and above
shape <- (500 / 100)^2 #mean/sd
rate <- 500 / 100^2 #mean/sd
rgamma(n_years, shape = shape, rate = rate)
},
TRUE ~
# For the rest, it follows their respective distribution parameters
rnorm(n_years, mean = income_base, sd = decision_row$std_dev_income)
)
# Yearly chance events
health_shock <- rbinom(n_years, 1, decision_row$health_shock_prob)
burnout <- rbinom(n_years, 1, decision_row$burnout_prob)
# Parenting - which affects the contentness
# First child: already present (baseline) - dont consider
# Second child: guaranteed to arrive within first 5 years
kid2_year <- sample(2:6, 1)
# Parenting flag for the second
parenting2 <- ifelse((1:n_years) >= kid2_year, 1, 0)
# Contentness
content <- rnorm(n_years, decision_row$base_content, decision_row$content_sd) -
0.4 * burnout - 0.6 * health_shock + 0.3 * parenting2
# Scale must be between 1 (very not content), to 5 (very content)
content <- pmin(5, pmax(1, content))
# Savings of 10% each year, minus healthshock prob * its cost
savings <- cumsum(income * 0.1 - 5 * health_shock) # health schock cost in 1000's too
# Standardising a total hapiness score. each one contributes 20-40% to it
# happiness <- 0.4 * scale(income)[, 1] +
# 0.4 * scale(content)[, 1] +
# 0.2 * scale(savings)[, 1]
# make tibble for output
tibble(
year = 1:n_years,
sim = sim_id,
decision = decision_row$decision,
income,
content,
burnout,
health_shock,
parenting2,
savings
)
}2.2 Output
2.2.1 Run and summarize
For each decision, we simulate 1000 iterations of what would happen over 20 years.
results <- map(1:nrow(decisions), function(i) {
decision_row <- decisions[i, ]
map(1:n_sim, ~simulate_one_life(decision_row, .x)) |>
list_rbind()
}) |> list_rbind()It literally only takes seconds to simulate three diferent life paths (20 years), 1000 times. Holy moley. But, I know from experience that including more variables or more simulations increases the time non-linearly.
Now, to get some point estimates for each decision, we can summarise the results by grouping the simulations results by simulation number, and decision and calculating some averages, and the last of the savings.
Then, we group by each decision, and then just take the mean of each value, to get a point estimate for each decision.
The mean of all simulations for each decision. Mean of mean of the possible values, if you will.
summary <- results |>
group_by(sim, decision) |>
summarise(
avg_income = mean(income),
avg_content = mean(content),
burnout_rate = mean(burnout),
health_event_rate = mean(health_shock),
end_savings = last(savings),
.groups = "drop"
) |>
select(-sim) |> # <- remove sim so it's not re-summarized
group_by(decision) |>
summarise(across(everything(), list(mean = mean, sd = sd)), .groups = "drop")
print(summary)# A tibble: 3 × 11
decision avg_income_mean avg_income_sd avg_content_mean avg_content_sd
<chr> <dbl> <dbl> <dbl> <dbl>
1 Private 729. 2.29 3.39 0.0980
2 Public 486. 1.14 3.71 0.0782
3 Startup 499. 22.8 3.07 0.206
# ℹ 6 more variables: burnout_rate_mean <dbl>, burnout_rate_sd <dbl>,
# health_event_rate_mean <dbl>, health_event_rate_sd <dbl>,
# end_savings_mean <dbl>, end_savings_sd <dbl>
2.2.2 Visualize
Now for perhaps the most helpful part of this: the raw visualisation of the possible future lifepath outcomes.
results |>
ggplot(aes(x = income, fill = decision)) +
geom_density(alpha = 0.4) +
labs(
title = "Life Income Distribution by Decision Path",
x = "Total Income in 1000 DKK over 30 Years", y = "Density") +
theme_minimal()
results |>
ggplot(aes(x = content, fill = decision)) +
geom_density(alpha = 0.4) +
labs(
title = "Life Income Distribution by Decision Path",
x = "Total Income in 1000 DKK over 30 Years", y = "Density") +
theme_minimal()
results |>
ggplot(aes(x = income, y = content, color = decision)) +
geom_point(alpha = 0.2) +
geom_point(
data = summary,
aes(x = avg_income_mean, y = avg_content_mean, group = decision),
color = "black",
shape = 18) +
geom_text(
data = summary,
aes(x = avg_income_mean, y = avg_content_mean, label = decision),
vjust = -1,
inherit.aes = FALSE
) +
labs(
title = "Life Income Distribution by Decision Path - Point Estimates Added",
x = "Total Income in 1000 DKK over 30 Years") +
theme_minimal()
ggsave("thumbnail.jpg", plot = last_plot(), width = 6, height = 4) # saved as thumbnailAs the pure scatter plots are a bit wild on the eyes, I added the point estimates from the previous summary, for each decision.
So there. A truckload of simulations of arbitrary lifepaths with relatively arbitrary data feeding it. The following can be seen at a glance:
- The public sector job decision sits with relatively high contentness at lower income
- The private sector job has much higher income, at slightly more dispersed contentness
- The startup path is mired in such great uncertainty, that its cloud takes up a lot of space. It has both the highest and lowest income and contentness, with many simulations at the highest and lowest contentment levels.
I will skip visualising the savings for now.
What is more important to visualise is the effects besides the income and contentness, that is:
- How much burnout and health schock are associated with each decision?
results |>
group_by(decision, year) |>
summarise(burnout_rate = mean(burnout), .groups = "drop") |>
ggplot(aes(x = year, y = burnout_rate, color = decision)) +
geom_line(linewidth = 1) +
labs(
title = "Burnout Rate by Decision Over Time",
y = "Proportion Burned Out",
x = "Year"
) +
theme_minimal()
results |>
group_by(decision, year) |>
summarise(shock_rate = mean(health_shock), .groups = "drop") |>
ggplot(aes(x = year, y = shock_rate, color = decision)) +
geom_line(linewidth = 1) +
labs(
title = "Annual Health Shock Rate by Decision",
y = "Proportion with Health Shock",
x = "Year"
) +
theme_minimal()
Around twice the burnout rate and three times the health shock rate for the Startup decision relative to the others.
So the simulations accurately represents (because the inputs accurately represents), the high risk high reward style of starting a company.
3 Summary / cleaned up / refactored
So, to summarise.
I created three imaginary life paths that I could take and assumed that I would stick to the path chosen for the simulation period.
Then I defined some different base values and uncertanties in the form of standard deviations and probabilities, for the life paths, in a little data frame.
These were then put into a simulation function that defined what would occur for each simulation, with differences for each path.
Then the simulation function was applied to the life path dataframe, for each decision, for 20 years, 1000 times.
And the results were visualised with density plots, scatter plots for content. With these arbitrary inputs, not much can be gleaned from such a simulation, other than the fact that uncertainty plays a huge role and one can have a lot of different outcomes. Perhaps based on this one could decide whether one felt
3.1 A Note on AI
AI (ChatGPT) played a role in this blog in the following ways:
- Carrying my idea into practice (writing the main code)
- Changing the code
The amount of time I saved from thinking my way through this is enormous. Which leads to the critical question asked often these days:
What have I lost in terms of learning, with the short term gain in efficiency from using AI?
This applies to both point 1 and 2 above.
For both points, it is obvious that I have lost something. It has been established by many researchers that slow and deliberate practice fosters better brain development and learning than just absorbing summaries or cramming a bunch of information in your head in a short amount of time.
From my memory, the following has been shown to be very beneficial for learning:
- Pen and paper (in most scenarios)
- Slow, deliberate learning with clear goals for which progress can be tracked
- Spaced repitition (learn about a thing.. let it sit a bit.. bring it to your memory again.. wait longer.. bring it to your memory, rinse, repeat)
- Explaining it in your own terms, and to others (hitching on to the famous “Feynman Technique”)
- Doing it in a location and environment that specifically fits you, but probably mostly with distractions removed
- Short intense bursts, and then breaks with NOTHING to do but relaxing (not on phone or SOME)
However, is there perhaps also an obvious benefit to using AI? For example, how different is AI from , at least parts of, the mentor-mentee relationship?
While a real mentor provides the following:
- Personalised feedback and guidance on:
- Progress
- Strengths and weaknesses
- Career advice
- Field specific expert level knowledge transfer
- Tacit knowledge transfer
- Motivation
- Public approval and recommendations to others
- Trust and support
An AI mentor can in large part provide the same, where, I believe, the following is significantly different, in terms of what the AI can do:
- Much quicker access to knowledge
- Poor deep brainstorming and not precise progress tracking
- A real expert will know exactly how much a person knows, and when to prompt different levels or stages of knowledge
- Tracking of progress is fine until AI hallucinations occur
- Much weaker transfer of tacit knowledge
- Weaker motivation
- The text version of “You’ve got it!” does not hit nearly the same way as the real life, honest, and enthusiastic version
- No recommendation and trust
- AI can’t abuse you (bad mentor)
Or, as ChatGPT puts it:
What it can’t provide are the richer social-emotional cues, contextual wisdom and professional sponsorship that come from a flesh-and-blood mentor embedded in your world. Treat the model as a tireless practice partner and sounding board, not a full substitute, and you’ll capture the best of both worlds.
In summary, I believe AI will raise the floor of learning, making it resemble plot B more than plot A. See below.
df1 <- data.frame(
a = as.numeric(rnorm(1000, 100, 15)),
b = as.numeric(rnorm(1000, 100, 25))
)
ggplot(data = df1) +
geom_density(aes(x=a), color = "red") +
geom_density(aes(x=b), color = "blue") +
labs(x = NULL,y = NULL) + theme_minimal()
I set the mean around 100 to make it resemble IQ.
That is, I believe:
- It will be a good tool to gain access to knowledge for people who were either too lazy to access it or were prevented in some way before
- There are some who will overly rely on it and suffer the consequences of losing some of their ability to think for themselves
- There are some who will recognise its potential and downsides, and use it when it is appropriate, soring about others
The last group is likely the ones who are least likely, at least in the white-collar jobs, to be replaced by AI.
Up until AI is more “Agentic”, ofcourse. By then I have no idea what will happen. Some smart people are thinking about this, though.
3.2 Refactored code
Below you can find the entire code, refactored a bit.
I used the following prompt, which worked fine although it actually made some errors that made me have to remake some of it.
First, please take out all chunks, and refactor them in the following way:
- make code more concise, with functions for repeating lines or plots
- remove comments
- remove unecessary code
library(tidyverse)
n_years <- 20
n_sim <- 1000
decisions <- tibble(
decision = c("Private", "Public", "Startup"),
startup_prob_success = c(NA, NA, 0.30),
# all moneys in 1000 DKK/year
startup_income_success = c(NA, NA, 6000),
base_income = c(600, 400, 50),
growth_rate_income = c(0.02, 0.02, 0.03),
std_dev_income = c(10, 5, 300),
base_content = c(3.2, 3.5, 3.0),
content_sd = c(0.4, 0.3, 0.9),
burnout_prob = c(0.10, 0.05, 0.25),
health_shock_prob = c(0.05, 0.05, 0.15)
)
simulate_one_life <- function(decision_row, sim_id) {
has_success <- ifelse(decision_row$decision == "startup",
rbinom(1, 1, decision_row$startup_prob_success), NA)
years <- 1:n_years
income_base <- decision_row$base_income * (1 + decision_row$growth_rate_income)^(years - 1)
income <- case_when(
decision_row$decision == "Startup" & !is.na(has_success) & has_success == 1 ~
rep(decision_row$startup_prob_success, n_years),
decision_row$decision == "Startup" ~ {
shape <- (500 / 100)^2 #mean/sd
rate <- 500 / 100^2 #mean/sd
rgamma(n_years, shape = shape, rate = rate)
},
TRUE ~ rnorm(n_years, mean = income_base, sd = decision_row$std_dev_income)
)
health_shock <- rbinom(n_years, 1, decision_row$health_shock_prob)
burnout <- rbinom(n_years, 1, decision_row$burnout_prob)
kid2_year <- sample(2:6, 1)
parenting2 <- ifelse((1:n_years) >= kid2_year, 1, 0)
content <- rnorm(n_years, decision_row$base_content, decision_row$content_sd) -
0.4 * burnout - 0.6 * health_shock + 0.3 * parenting2
content <- pmin(5, pmax(1, content))
savings <- cumsum(income * 0.1 - 5 * health_shock)
tibble(
year = 1:n_years,
sim = sim_id,
decision = decision_row$decision,
income,
content,
burnout,
health_shock,
parenting2,
savings
)
}
run_decision <- function(row, reps) {
map(seq_len(reps), ~simulate_one_life(row, .x)) |> list_rbind()
}
results <- map(seq_len(nrow(decisions)),
~run_decision(decisions[.x, ], n_sim)) |> list_rbind()
summary <- results |>
group_by(sim, decision) |>
summarise(
avg_income = mean(income),
avg_content = mean(content),
burnout_rate = mean(burnout),
health_rate = mean(health_shock),
end_savings = last(savings),
.groups = "drop"
) |>
select(-sim) |>
group_by(decision) |>
summarise(across(everything(), list(mean = mean, sd = sd)), .groups = "drop")
plot_density <- function(dat, var){
ggplot(dat, aes({{ var }}, fill = decision)) +
geom_density(alpha = 0.4) +
theme_minimal()
}
plot_scatter <- function(dat, sumdat){
ggplot(dat, aes(income, content, colour = decision)) +
geom_point(alpha = 0.2) +
geom_point(data = sumdat,
aes(avg_income_mean, avg_content_mean),
colour = "black", shape = 18, inherit.aes = FALSE) +
geom_text(data = sumdat,
aes(avg_income_mean, avg_content_mean, label = decision),
vjust = -1, inherit.aes = FALSE) +
theme_minimal()
}
plot_rate <- function(dat, var, ttl){
ggplot(dat %>% group_by(decision, year) %>%
summarise(rate = mean({{ var }}), .groups = "drop"),
aes(year, rate, colour = decision)) +
geom_line(linewidth = 1) +
labs(title = ttl) +
theme_minimal()
}
plot_density(results, income)plot_density(results, content)plot_scatter(results, summary)plot_rate(results, burnout, "Burnout Rate")plot_rate(results, health_shock, "Health-Shock Rate")